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Abstract 

A  simple  statistical  segmental  approach  to  speech  pattern  modelling,  based 
on  segmental  hidden  Markov  models,  is  proposed  which  addresses  some  of 
the  limitations  of  conventional  hidden  Markov  model  based  methods.  The 
most  important  features  of  the  new  approach  are  the  use  of  an  underlying 
semi-Markov  process  to  model  speech  at  the  segment  level,  rather  than  time- 
synchronous  frame  level,  and  to  enable  improved  segment  duration  modelling, 
and  the  development  of  a  segment  model  in  which  separate  statistical  processes 
are  used  to  characterise  extra-state  and  intra-state  variability,  thus  making 
the  temporal  independence  assumption  more  acceptable  within  a  segment. 
A  basic  mathematical  analysis  of  gaussian  segmental  hidden  Markov  models 
is  presented  and  model  parameter  reestimation  equations  are  derived.  The 
relationship  between  the  new  type  of  model  and  variable  frame  rate  analysis 
and  conventional  gaussian  mixture  based  hidden  Markov  models  is  exposed. 
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1  Executive  Summary 


Potential  military  applications  of  advanced  speech  technology  are  particularly  demand¬ 
ing  in  terms  of  acoustic  environment,  channel  characteristics,  speaker  variability  and 
vocabulary  flexibility.  As  higher-level  techniques  emerge  which  address  these  issues, 
ever-increasing  demands  are  placed  on  recogniser  performance  at  the  acoustic-phonetic 
level.  This  is  the  fundamental  stage  in  the  recognition  process  where  speech  patterns  de¬ 
rived  &om  physical  measurements  are  interpreted  in  terms  of  symbols  which  describe  the 
basic  sounds  of  the  language.  Performance  at  this  level  clearly  depends  on  the  quality 
of  the  speech  models  which  are  used. 

The  most  successful  automatic  speech  recognition  systems  use  a  statistical  formalism, 
hidden  Markov  modelling,  to  model  speech  patterns,  together  with  powerful  mathemat¬ 
ical  methods  for  model  parameter  estimation  and  recognition.  Hidden  Markov  models 
(HMMs)  are  currently  the  best  compromise  between  mathematical  tractability  and  ac¬ 
ceptability  from  the  perspective  of  speech  science.  However,  from  the  latter  perspective 
many  of  the  assumptions  which  the  HMM  formalism  makes  about  the  structure  of  speech 
patterns  are  seriously  in  error.  This  constitutes  a  basic  limitation  on  recogniser  perfor¬ 
mance. 

To  overcome  this  limitation,  new  models  are  needed  which  more  faithfully  characterise 
important  aspects  of  speech  pattern  structure  and  which,  at  the  same  time,  are  amenable 
to  rigorous  mathematical  methods  for  parameter  estimation  and  classification. 

From  the  viewpoint  of  speech  pattern  modelling,  the  most  significant  limitations  of  the 
conventional  HMM  formalism  axe: 

(i)  the  time-synchronous  nature  of  the  modelling,  where  it  is  assumed  that  the  acoustic 
feature  vector  at  a  particular  time  depends  only  on  the  state  of  the  model  at  that 
time  and  is  otherwise  independent  of  the  preceeding  vectors 

(ii)  the  assumption  that  speech  patterns  are  piecewise  stationary  with  instantaneous 
transitions  bewteen  stationary  regions. 

This  report  describes  an  extension  of  the  HMM  formalism  which  tackles  the  first  lim¬ 
itation  by  taking  explicit  account  of  the  segmental  nature  of  speech  patterns.  This  is 
seen  as  a  step  towards  the  development  of  dynamic  segmental  statistical  models  which 
are  able  to  explicitely  model  the  dynamic  behaviour  of  speech  patterns. 

An  initial  theory  of  time- asynchronous  static  segmental  HMMs  is  presented,  in  which 
sources  of  extra-segmental  variation,  such  as  identity  of  speaker  or  the  particular  choice 
of  acoustic  target,  are  fixed  throughout  a  segment  rather  than  being  allowed  to  vary 
time- synchronously  as  in  a  conventional  HMM.  The  most  important  result  is  that  the 
conventional  HMM  parameter  estimation  algorithm  can  be  extended  to  this  new  type  of 
segmental  HMM.  The  main  part  of  the  memorandum  is  concerned  with  the  derivation 
of  the  extended  parameter  estimation  algorithm  and  a  formal  proof  of  its  validity. 
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2  Introduction 


At  present  the  most  successfiil  automatic  speech  recognition  systems,  in  terms  of  recog¬ 
nition  accuracy,  are  those  which  use  hidden  Markov  models  (HMMs)  to  model  speech 
at  the  acoustic  level  and  dynamic  progr&*nming  based  recognition  algorithms  which  find 
the  best  interpretation  of  an  unknown  speech  pattern  in  terms  of  the  output  of  a  se¬ 
quence  of  HMMs.  The  most  recent  systems,  such  as  those  developed  under  the  DARPA 
Spoken  Language  Systems  project  and  at  IBM  and  Dragon  in  the  USA,  and  RSRE’s 
“ARMADA”  system  in  the  UK,  use  HMMs  to  model  speech  at  the  phoneme  level  in 
order  to  address  medium  to  large  vocabularies  and  to  avoid  vocabtilary-specific  training. 

This  success  is  due  to  two  factors.  Firstly  HMMs  provide  a  formal  statistical  framework 
which  is  broadly  appropriate  for  modelling  speech  patterns.  This  single  framework  is 
able  simultaneously  to  accomodate  the  time-varying  nature  of  speech  patterns,  through 
the  structure  of  the  underlying  Markov  process,  and  the  variable  segmental  structure  of 
these  patterns  through  the  statistical  processes  which  are  identified  with  the  states  of  the 
model.  Secondly  there  exist  computationaUy  useful  and  rigorous  mathematical  methods 
for  automatically  optimising  the  parameters  of  a  set  of  HMMs  relative  to  training  data, 
and  for  classifying  an  unknown  speech  pattern  given  a  set  of  HMMs.  These  are  the 
Baum- Welch  algorithm,  which  is  used  to  adjust  the  parameters  of  a  set  of  HMMs  in 
order  to  (locally)  maximise  the  probability  of  a  given  set  of  training  material  conditioned 
on  these  HMMs,  and  the  Viterbi,  or  One-Pass  Dynamic  Programming  algorithm  which 
computes,  in  a  particular  sense,  the  most  probable  sequence  of  HMMs  given  an  unknown 
speech  pattern. 

These  two  factors  taken  together  (a  broadly  appropriate  formalism  and  the  existence  of 
rigorous  mathematical  methods  for  manipulating  that  formalism)  constitute  a  powerful 
tool  for  speech  pattern  processing.  However  from  the  perspective  of  speech  science  it 
is  clear  that  the  assumptions  which  the  HMM  formahsm  imposes  on  the  structure  of 
speech  patterns  are  inappropriate  in  several  respects. 


(a)  Piecewise  Stationarity  The  HMM  framework  assumes  that  a  speech  pattern  is 
produced  by  a  piecewise  stationary  process,  with  instantaneous  transitions  between 
the  stationary  states.  This  is  clearly  at  variance  with  the  fact  that  speech  patterns 
are  derived  from  signals  produced  by  a  continuously  moving  physical  system  -  the 
vocal  tract. 

(b)  Properties  of  the  States  In  a  standard  HMM  the  statistical  process  associ¬ 
ated  with  a  state  is  defined  by  a  single  probability  density  function  (pdf).  This 
pdf  typically  has  to  accommodate  several  quite  distinct  types  of  variabihty,  for 
example: 

-  Long-term  extra-segmental  variations,  such  as  speaker  sex,  identity  of  speaker, 
and  long-term  prosodic  phenomena,  which  are  essentially  fixed  throughout  the 
duration  of  a  segment. 

—  Short-term  intra-segment  variations  which  occur  once  the  segment  target  has 
been  achieved. 
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In  addition,  in  reality  the  configuration  of  the  vocal  tract  is  not  even  nominaUy 
stationary,  for  example  in  the  dynamic  part  of  a  diphthong  or  in  most  consonants, 
and  this  is  another  source  of  variablity. 

A  further  consideration  is  the  fact  that  in  order  for  Baum- Welch  parameter  rees¬ 
timation  theory  to  apply,  the  class  of  the  state  output  pdf  is  resricted  to  non- 
parametric  discrete  distributions  (in  the  case  where  the  front-end  processing  in¬ 
cludes  quantisation  to  ensure  that  all  observation  vectors  are  drawn  from  given  a 
finite  set),  or  mixtures  of  multivariate  gaussian  pdfs  (see  [7]).  The  extent  to  which 
such  pdfs  are  appropriate  for  modelling  acoustic  feature  vectors  in  speech  patterns 
has  been  considered  by  Richter  [9] 

(c)  The  Independence  Assumption  It  is  assumed  that  the  probability  that  a  given 
acoustic  vector  corresponds  to  a  given  state  of  the  HMM  depends  (directly)  only  on 
the  vector  and  the  state,  and  is  otherwise  independent  of  the  sequence  of  acoustic 
vectors  and  states  which  preceed  and  succeed  the  current  vector  and  state.  Thus 
the  model  take  no  account  of  the  dynamical  constraints  of  the  physical  system 
which  has  generated  a  particular  sequence  of  acoustic  data. 

Clearly,  the  problems  associated  with  the  independence  assumption  are  exacer¬ 
bated  by  the  use  of  a  single  density  to  model  ail  sources  of  variability  (see  (b)).  For 
example,  in  a  speaker-independent  system  in  which  high-order  mixture  densities 
are  used  to  model  inter-speaker  variations,  the  model  assumes  that  each  acoustic 
feature  vector  in  a  sequence  may  have  been  produced  by  a  different  speaker. 

(d)  State  Duration  Because  of  the  Markov  assumption,  state  (and  hence  speech 
segment)  duration  in  a  HMM  conforms  to  a  geometric  pdf  which  assigns  maximum 
probability  to  state  duration  1  and  successively  smaller  probabilities  to  longer 
durations.  This  is  not  an  appropriate  model  of  speech  segement  duration. 

(e)  Model  Topologies  The  basic  segmental-sequential  structure  of  the  patterns  cor¬ 
responding  to  a  particular  HMM  is  determined  by  the  topology  of  the  underlying 
Markov  model  (i.e.  the  number  of  states  and  the  permitted  transitions  between 
states).  In  most  HMM-based  speech  recognition  systems  a  common  HMM  topology 
is  chosen  for  all  models.  However  the  patterns  which  are  to  be  modelled  typically 
exhibit  a  range  of  types  of  sequential  strucure. 

Most  af  the  progress  which  h«  been  achieved  over  recent  years  has  resulted  from  working 
within  this  basic  HMM  framework.  There  has  been  very  httle  work  aimed  at  extending 
the  HMM  formalism  in  ways  which  address  the  limitations  listed  above.  One  reason 
for  this  is  the  realisation  of  the  importance  of  the  mathematical  tools  associated  with 
HMMs  and  an  acknowledgement  of  the  need  to  extend  these  mathematical  techniques 
in  parallel  with  any  extension  of  the  basic  formalism. 

An  example  of  the  way  in  which  the  conventional  HMM  formalism  can  be  extended  is 
the  work  on  hidden  semi-Markov  markov  models  (HSMMs)  reported  in  [10],  [11]  and 
[6]  in  which  the  HSMM  structure  enables  the  geometric  model  of  state  duration  in  a 
standard  HMM  to  be  replaced  by  something  more  appropriate.  In  a  HSMM  the  under¬ 
lying  Markov  process  is  replaced  with  a  semi-Markov  process  in  which  state  duration 
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is  explicitely  modelled  by  state-dependent  state  duration  pdfs.  It  was  shown  that  the 
standard  HMM  optimisation  and  recognition  algorithms  can  be  extended  to  HSMMs 
with  non- parametric,  Poisson  and  Gamma  state  duration  pdfs  [6,  10].  Small  vocabu¬ 
lary  speech  recognition  experiments  were  conducted  which  showed  that  HSMMs  could 
consistently  outperform  standard  HMMs  [3,  12]. 

The  essential  difference  between  HMMs  and  HSMMs  is  that  HMMs  axe  time- synchronous 
in  the  sense  that  states  are  associated  with  single  acoustic  vectors,  whereas  in  a  HSMM 
states  are  associated  with  sequences  of  acoustic  vectors.  Hence,  in  addition  to  their 
utility  for  duration  modelling,  HSMMs  offer  a  computationally  useful  framework  for 
more  general  modelling  of  speech  at  the  segment  level. 

The  purpose  of  this  memorandum  is  to  present  a  new  HSMM  based  segment  level 
stochastic  model  which  addresses  some  of  the  limitations  of  HMMs  which  have  been 
listed  above,  and  which  at  the  same  time  is  computationally  useful  in  the  sense  that 
the  existing  HMM  parameter  estimation  and  classification  algorithms  can  be  extended 
to  this  new  class  of  model.  The  basis  of  the  new  model  is  the  notion  of  separating  the 
modelling  of  sources  of  variability  which  apply  above  the  segment  level  from  that  of 
sources  of  variability  which  apply  within  a  segment.  Intuitively,  since  in  the  present 
context  segments  are  sub-phonemic,  variations  due  to  factors  ranging  from  identity  of 
speaker  down  to  the  choice  of  “target  realisation”  of  a  particular  sound  fall  into  the  first 
category,  while  subsequent  variations  around  that  target  fall  into  the  second  category. 
It  is  this  perspective  which  leads  to  the  use  of  terms  such  as  “target”  in  the  discussions 
which  follow. 

The  organisation  of  the  memorandum  is  as  follows.  Section  3  introduces  the  termi¬ 
nology  and  notation  of  hidden  Markov  and  semi-Markov  models  which  is  necessary  for 
the  development  of  segmental  hidden  semi-Markov  models.  Section  4  presents  a  gen¬ 
eral  formal  definition  of  this  new  type  of  segmental  model.  Section  5  introduces  the 
special  case  of  gaussian  segmental  HSMMs.  A  simple  example  of  this  type  of  model  is 
compared  with  the  corresponding  conventional  HSMM.  The  section  goes  on  to  present 
a  basic  mathematical  analysis  of  gaussian  segmental  HSMMs.  In  section  6  it  is  shown 
that  gaussian  segmental  models  can  be  viewed  as  an  extension  of  conventional  variable 
frame-rate  analysis  in  which  dynamic  programming  based  variable  frame  rate  analysis 
is  integrated  with  Markov  model  based  processing.  The  relationship  w’ith  HMMs  with 
gaussian  mixture  densities  is  explored  in  section  7.  Section  8  presents  a  derivation  of 
Baum- Welch  type  reestimation  formulae  for  segmental  HSMM  parameters. 


3  Hidden  Semi-Markov  Models 

3.1  Hidden  Markov  processes 

In  the  standzurd  hidden  Markov  model  (HMM)  based  approach  to  speech  pattern  mod¬ 
elling  it  is  assumed  that  a  sequence  of  observed  multi-dimensional  acoustic  vectors, 

y  = 
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corresponding  to  a  given  speech  signal,  is  a  probabilistic  function  of  a  hidden  state 
sequence 

SC  ***9  ^1)  ***) 

where  each  Xf  is  drawn  from  a  finite  set  of  states  a  =  <tjv}.  The  sequential  and 

durational  statistics  of  z  are  determined  by  a  transition  probability  matrix 

A  =  . N 

where,  a.ij  —  Prob{xt  =  =  a,)  is  the  probability  of  a  transition  from  state  <7^  to 

state  <7j,  smd  an  initial  state  probability  vector 

’T  =  [jr,]i=i . N 

where  Xj  =  PToh{xi  =  <t,).  The  pair  M.  =  (x,j4)  define  an  N  state  Markov  process. 
The  relationship  between  the  observation  vectors  yt  and  the  hidden  states  Xt  is  defined 
by  a  set  of  probability  density  functions  {h,  }i=:i,...,jVj  where 

bi{o)  =  Prob{yt  =  o|x,  =  Ci) 

is  the  probability  that  the  observation  o  is  associated  with  state  <t,  .  The  triple  Ji  = 
(7r,j4,{6,})  defines  a  hidden  Markov  process.  The  process  is  called  hidden  because  it 
is  not  possible  to  infer  unambiguously  the  exact  state  sequence  which  gave  rise  to  a 
particular  observation  sequence. 


3.2  Hidden  semi- Markov  processes 

A  semi-Markov  process  is  obtained  by  associating  a  probability  density  function  Vi, 
defined  on  the  set  of  positive  integers,  with  each  state  Ci  of  an  JV-state  Markov  process 
At  =  (tt,  A).  For  d  =  1,2,3, ..,  Vi{d)  is  the  probability  of  occupying  state  tTi  for  precisely 
d  time  units.  The  density  Vi  is  called  the  state  duration  pdf  associated  with  state  a, , 
and  the  Markov  process  M  is  called  the  underlying  Markov  process. 

A  hidden  semi-Markov  process  is  a  probabilistic  function  of  a  semi-Markov  process. 
More  precisely,  an  N  state  hidden  semi-Markov  model  (HSMM),  or  Variable  Duration 
HMM  [6],  is  a  4-tuple  S  =  (ir.  A,  where: 

•  Af  =  (’r,A)  is  an  AT-state  Markov  model 

•  V\,  ...,Vs  is  a  set  of  N  state  duration  pdfs,  Pj  :  N  — ♦  [0,1] 

•  bi, ...,  bff  is  a  set  of  N  state  output  pdfs,  6,  :  R**  — ♦  [0, 1] 

where  N  and  R**  denote  the  positive  integers  and  real  d  dimensional  space  respectively. 

Intuitively  one  can  visualise  a  hidden  semi-Markov  process  as  follows.  At  some  time 
t  =  1  the  process  enters  state  Xj  =  <7,-,  chosen  randomly  according  to  the  initial  state 
probability  vector  x.  A  duration  di  is  chosen  randomly  according  to  the  state  duration 
pdf  Vi,  and  a  sequence  yi,...,yj,  of  di  acoustic  vectors  is  generated  randomly  and  inde¬ 
pendently  according  to  the  state  output  pdf  6^.  The  process  then  moves  from  state  cr, 
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to  state  Oj  according  to  the  state  transition  probability  matrix  A.  In  general,  at  time 
t  =  +  dj  +  •••  +  dm_i  +  1  the  process  enters  state  =  c,.  As  in  the  case  of  the  first 

state,  a  duration  is  chosen  randomly  according  to  the  state  duration  pdf  27, ,  and  a 
sequence  y,,  of  dm  acoustic  vectors  is  generated  randomly  and  independently 

according  to  the  state  output  pdf  hi. 

It  is  straightforward  to  show  the  principle  of  dynamic  programming  can  be  extended 
from  Markov  to  semi-Markov  processes.  Consequently  the  standard  dynamic  program¬ 
ming  based  recognition  algorithms  can  be  extended  from  HMMs  to  HSMMs.  In  addition 
it  has  been  demonstrated  that  Baum’s  theorem  (and  hence  the  Baum-Welch  parameter 
estimation  algorithm)  can  be  extended  to  HSMMs  with  discrete,  Poisson  or  Gamma 
state  duration  pdfs  ([10,  6]).  In  all  cases  the  need  to  explicitely  consider  times  t  —  8 
(8  =  1,2,  ...,dmax)  during  HSMM  based  computations  leads  to  an  increase  in  computa¬ 
tional  load  relative  to  HMMs,  however  HSMMs  still  provide  a  computationally  useful 
formalism. 


3.3  Advantages  of  Hidden  Semi-Markov  Models 

In  the  past,  HSMMs  have  primarily  been  used  to  remedy  the  limitations  of  HMMs  with 
respect  to  speech  segment  duration  modelling,  and  have  not  been  used  to  address  any 
of  the  other  limitations  of  HMMs  which  are  listed  in  the  introduction.  Consequently, 
because  the  improvements  in  recognition  accuracy  which  resiilt  from  better  duration 
modelling  are  generally  relatively  modest  and  the  increase  in  computational  load  is 
relatively  high,  there  has  been  little  recent  work  in  this  area.  The  objective  of  this 
memorandum  is  to  show  that,  leaving  the  duration  modelling  capabilities  of  HSMMs 
aside,  the  segment  based  formalism  provided  by  HSMMs  can  be  exploited  to  address 
some  of  the  other  limitations  of  HMMs. 


4  Segmental  Hidden  Markov  Models 

This  section  proposes  a  segmental  model  of  speech  which  is  an  extension  of  the  conven¬ 
tional  HSMM  described  above.  The  model  is  motivated  by  the  need  to  explicitely  deal 
separately  with  the  different  types  of  variability  which  are  accomodated  in  the  state 
output  pdf  of  a  conventional  HMM  or  HSMM,  thereby  making  the  independence  as¬ 
sumption  more  realistic.  Hence  the  new  model  explicitely  addresses  points  (b)  and  (d) 
of  the  introduction  and  implicitely  addresses  point  (c). 

In  a  conventional  HSMM  the  stochastic  process  associated  with  state  <Ti  is  defined  by  a 
state  output  pdf 

fr.  .•R"-^[0,1]. 

In  the  associated  model  of  speech  pattern  production,  at  each  time  t  an  observation 
vector  pi  is  produced  randomly  according  to  the  pdf  6,-.  The  vector  yt  clearly  depends  on 
the  state  <t,  but  is  otherwise  independent  of  the  observations  yi,...,yt-i  which  preceded 
it.  However,  it  has  already  been  noted  that  the  pdf  hi  typically  accomodates  several 
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different  types  of  variability,  including  variations  in  the  target  for  a  given  sound  (both 
intra-speaker  and  inter-speaker)  and  natural  variations  which  occur  once  the  target  has 
been  chosen.  Because  all  of  these  types  of  variablity  are  modelled  by  a  single  pdf  and 
the  sequence  of  observation  vectors  are  generated  independently  (according  to  that  pdf) 
there  is  nothing  to  prevent  successive  observation  vectors  corresponding  to  quite  different 
sets  of  extra-segmental  factors,  such  as  different  speakers. 

The  proposed  model  overcomes  this  problem  by  using  separate  processes  to  model  extra¬ 
state  and  intra-state  variations. 


4.1  Definition  of  the  model 

In  the  state  model  described  below,  extra-state  variations  associated  with  state  c,  are 
modelled  by  a  pdf  6,  called  the  state  target  pdf.  Arrival  at  state  Oj  causes  a  single  output 
to  be  generated  by  the  process  associated  with  this  pdf.  This  output,  which  wiU  be 
called  a  target  is  a  pdf  v  which  can  be  regarded  as  modelling  within-state  variability 
once  all  sources  of  extra-state  variability  have  been  fixed.  Thus  on  entering  state  <Ti,  a 
state  duration  d  is  chosen  randomly  according  to  the  state  duration  pdf  TDi  and  a  target 

V  is  chosen  randomly  according  to  the  pdf  hi .  A  sequence  of  d  observation  vectors  is  then 
produced,  with  each  individual  observation  being  generated  randomly  and  independently 
of  its  successors  according  to  the  pdf  v. 

More  formally,  the  stochastic  process  associated  with  state  Oi  is  governed  by  a  probability 
density  function 

hi  :  -*  [0, 1] 

where  V"  denotes  a  subset  of  the  set  of  probability  density  functions  defined  on  n- 
dimensional  space  R”. 

In  other  words,  in  the  new  type  of  model  a  target  is  defined  to  be  a  pdf  v  which  can  be 
thought  of  as  modelling  variations  in  the  acoustic  pattern  which  occur  once  the  speaker 
and  all  other  sources  of  extra-state  variability  have  been  selected.  This  target  is  fixed 
throughout  a  particular  state  occupancy.  The  state  target  pdf  hi  specifies  the  probability 
of  any  particular  target  given  state  <r,. 

Hence  although  the  sequence  of  observation  vectors  generated  by  such  a  state  model 
are  still  independent  samples  from  the  same  distribution  v,  in  this  case  the  distribution 

V  only  models  variations  which  occur  given  a  fixed  acoustic  target.  All  of  the  vectors 
which  are  generated  during  a  particular  state  occupancy  are  constrained  to  correspond 
to  the  same  target. 

Therefore,  formally,  an  A^-state  segmental  HSMM  is  a  5-tuple  Af  =  (tt.  A,  {P^},  'P”,  {^i}) 
where 

•  {ir,A,{Vi})  {i  =  1,.., iV)  is  an  N-st&te  semi-Markov  model 

•  P”  is  a  set  of  target  pdfs  defined  on  n-dinensional  space 


9 


•  hi  iT"  —*  [0, 1]  is  a  state  target  pdf  defined  on  V" 


In  the  above  notation,  a  state  of  a  segmental  HSMM  is  a  triple 
<T={V\h,V) 

Given  a  sequence  of  observations 
y  =  il/r 

the  joint  probability  of  the  sequence  y  and  a  particular  target  pdf  v  :  R"  — »  [0, 1]  given 
c  is  therefore  given  by: 

PAy.v)  =  V{T)b{v)Y[viyt)  (1) 

«=i 


5  Gaussian  Segmental  HSMMs 

Suppose  that  a  target  is  defined  to  be  any  gaussian  pdf  defined  on  n-dimensional  space 
R”  with  fixed  variance  t.  Given  that  the  variance  is  fixed,  a  target  is  defined  uniquely 
by  its  mean  and  hence  in  this  case  “P"  is  equal  to  R".  For  a  given  state  Cj  let  the  state 
target  pdf  be  a  gaussian  pdf  defined  on  P"  =  R**  with  mean  fii  and  variance 

7<. 


5.1  A  simple  example 

To  illustrate  this,  consider  the  simple  case  of  a  1-state  1-dimensional  model  with  state 
mean  =  0  and  variance  71  =  6,  and  with  fixed  target  variance  t  =  0.5.  In  addition, 
suppose  that  state  duration  follows  a  Poisson  distribution  with  mean  state  duration 
Si  =  300,.  Figure  1  shows  a  sequence  of  1100  random  observations  generated  by  such  a 
process  in  the  manner  described  in  section  4.1.  The  figure  clearly  shows  3  separate  state 
occupancies.  At  the  start  of  each  occupancy  a  target  value  and  a  duration  are  chosen 
and  the  appropriate  number  of  observations  are  generated,  at  random,  from  a  relatively 
tight  pdf  centered  about  the  target. 

Now  consider  the  result  of  trying  to  model  this  as  a  conventional  1-state  hidden  semi- 
Markov  process.  The  latter  assumes  that  all  of  the  observations  are  generated  randomly 
and  independently  by  a  single  gaussian  process.  It  is  straightforward  to  show  that  the 
mean  of  this  process  is  equal  to  0,  the  mean  of  the  state  target  pdf  of  the  segmental 
process,  and  its  variance  is  0.6,  the  sum  of  the  variances  of  the  state  target  pdf  and 
individual  target  pdfs  of  the  segmental  process.  Figure  2  shows  a  sequence  of  1100 
observations  from  such  a  process  given  the  correct  Poisson  duration  statistics.  State 
transitions  occur  at  the  same  times  as  in  figure  1,  however  all  of  the  structure  apparent 
in  figure  1  which  shows  the  separate  state  occupancies  has  been  lost. 
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Figure  1:  Observations  from  1  state  segmental  bidden  semi-Markov  process  with  Poisson 
state  duration  statistics 


Figure  2:  Observations  from  1  state  hidden  semi-Markov  process  with  Poisson  state 
duration  statistics 


5.2  Mathematical  analysis  of  Gaussian  Segmental  HMMs 


The  purpose  of  this  section  is  to  present  the  basic  equations  which  are  necessary  for 
the  study  of  gaussian  segmental  HSMMs.  In  order  to  focus  attention  onto  those  aspects 
which  are  relevant  to  the  segmental  nature  of  the  models,  three  simplifying  assumptions 
have  been  made.  First,  the  precise  form  of  the  state  duration  pdf  associated  with  a 
particular  state  Ci  is  not  specified.  This  pdf  is  simply  denoted  by  Vi ,  and  it  is  assumed 
that  it  is  independent  of  the  parameters  of  the  segment  model,  namely  fii,  and 
(i  =  1,...,  jV).  Second  it  is  assumed  that  all  observations  are  1-dimensional.  The  latter 
assumption  is  also  unnecessary,  and  it  will  be  seen  that  the  arguments  can  be  extended 
to  multi-  dimensional  observations,  but  is  made  for  the  reasons  given  above.  The  fi¬ 
nal  simplifying  assumption  is  that  the  underlying  Markov  model  is  strictly  left-right. 
Again  this  is  not  necessary  but  it  will  significantly  simplify  notation,  particularly  in  the 
derivation  of  the  parameter  reestimation  formulae  in  section  8. 


5.2.1  Analysis  of  the  State  Model 

In  the  above  notation,  a  state  <7  of  a  gaussian  segmental  HSMM  is  a  triple 
<T  =  (■P,A/'(^,,),©) 

where  V  is  the  set  of  pdfs  defined  on  R  of  the  form  A/(,,t)  {x  6  R),  and  V  denotes  an 
appropriate  state  duration  pdf.  Given  a  sequence  of  observations 

y  = 

the  joint  probability  of  the  observation  sequence  y  and  a  particular  target^  c  given  state 
<r  is  given  by 

PAy,c)  =  D(r)Af,„„(c)n  (2) 

t=l 


and  the  probabilty  of  the  sequence  y  given  <t  is 

PAy)  =  lPAy.c)  (3) 

An  alternative  to  the  “fxiU  probability”  criterion  of  equation  (3),  which  is  more  am- 
menable  to  analysis,  is  to  consider  the  joint  probability  P,{y,c),  where  c  is  the  value  of 
the  target  c  which  maximises  P^{y,c).  Define 


PAy)  =  max^PAy^c) 
c  =  argmaXfPAyic) 


(4) 

(5) 


Claim 

'Here  the  term  ‘Harget"  is  being  used  to  refer  either  to  the  ganssian  pdf  with  mean  c  and 

fixed  variance  r  or  the  mean  value  e 
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(6) 


5.2.2  Analysis  of  MuIti>State  Models 

Now  consider  an  N  state  segmental  HMM  M,  where  the  tth  state  of  M  is  defined  by 

To  simplify  mathematical  notation  in  what  follows  it  will  be  assumed  that  the  underlying 
Markov  process  for  M  is  strictly  left-right,  in  the  sense  that  if  t  >  s,  X(  =  Cj  and  x,  =  Ci 
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then  j  >  i-  This  assumption  is  not  necessary,  but  it  enables  a  state  sequence  x  to  be 
written  in  the  form 

X  —  d>\  ^  di  ^  dpf  ^ 

where  </,-  0  <r,-  denotes  duration  d,-  in  state  Without  this  assumption  it  is  necessary 
to  introduce  an  extra  level  of  indirection,  to  map  the  mth  state  visited  in  the  sequence 
X  onto  the  correct  o-,-  and  to  account  for  multiple  occurances  of  states  in  the  sequence  x. 
This  is  straightforward,  but  the  additional  notation  which  is  required  obscures  the  basic 
simplicity  of  the  ideas  which  foillow. 

The  joint  probability  P{y,x\M.)  of  the  observation  sequence  y  and  the  state  sequence  x 
given  the  model  M.  is  given  by 

P{y.^\M)  =  n  (10) 

i=l 

where  y\  =  3/,,y,+i,  and  t,  is  the  largest  value  of  i  for  which  i,  =  Oi{i  >  1),  <o  =  0. 
The  probability  of  y  given  the  model  M  is  then  given  by: 

P{y\M)=^Y.P^y>^\^)  (11) 


By  analogy,  define 


P{y,x\M)  =  na.-j...P...(yi‘;_,+i) 

P{y\M)  = 


(12) 

(13) 


Hence  P{y^x\M)  is  similar  to  the  joint  probability  P{y,x\M)  except  that  in  the  com¬ 
putation  of  P{y,x\M.)  the  evaluation  of  the  probability  of  a  particular  subsequence  of  y 
given  a  state  a,  is  based  on  the  optimal  target  c.  Note  that  since  c  depends  on  the  state 
sequence  x  and  the  state  Ci  it  is  more  correct  to  write 


ytli 

+  di-Ti 


(14) 


The  analysis  which  follows,  and  in  particular  the  derivation  of  reestimation  formulae  in 
section  8,  will  focus  on  the  quantities  P(y,x|M)  and  P{y\M)  rather  than  P(y,x|A1) 
and  P{y\M) 


6  Relationship  with  Variable  IVame  Rate  Analysis 

In  this  section  it  will  be  shown  that  the  segmental  HMM  based  analysis  proposed  in  this 
memorandum  can  be  regarded  as  an  extension  and  intergration  of  conventional  Variable 
Frame  Rate  (VFR)  analysis  and  hidden  Markov  modelling 
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6.1  Variable  Rate  Analysis 


Variable  frame  rate  (VFR)  analysis  is  a  method  for  data-rate  reduction  which  has  been 
shown  to  give  improved  performance  over  fixed  frame  rate  analysis  for  automatic  speech 
recognition  [8].  In  its  simplest  form  VFR  is  used  to  remove  vectors  from  an  observation 
sequence.  A  distance  is  computed  between  the  current  observation  vector  and  the  most 
recently  retained  vector,  and  the  current  vector  is  discarded  if  this  distance  falls  below  a 
threshold  T.  When  a  new  observation  vector  causes  the  distance  to  exceed  the  threshold, 
the  new  vector  is  kept  and  becomes  the  most  recently  retained  vector.  VFR  analysis 
replaces  sequences  of  similar  vectors  with  a  single  vector,  and  hence  reduces  the  amount 
of  computation  required  for  recognition. 

What  is  interesting  is  that  VFR  analysis  can  also  improve  recognition  accuracy  [8]. 
There  are  a  number  of  possible  explanations  for  this: 

•  by  discarding  vectors  from  relatively  stationary  regions  of  the  speech  pattern,  VFR 
focusses  the  recognition  process  onto  the  dynamic  regions,  which  are  important  for 
classification 

•  vectors  in  the  relatively  stationary  regions  of  speech  patterns  are  highly  correlated, 
contrary  to  the  assumption  of  independence  which  is  part  of  the  EMM  formalism. 
Discarding  vectors  in  these  regions  results  in  observation  sequences  which  are  more 
consistent  with  the  formalism 

•  if  a  count  of  the  number  of  frames  which  each  retained  vector  replaces  is  appended 
to  that  vector,  then  some  implicit  duration  modelling  is  incorporated  into  the 
recognition  process 

6.2  Improvements  to  the  basic  VFR  algorithm 

The  basic  VFR  algorithm  described  above  can  be  improved  in  a  number  of  ways: 

6.2.1  Rather  than  replacing  a  sequence  of  acoustic  vectors  y„...,yt  with  y„  the  first 
vector  in  the  sequence,  it  should  replaced  with  some  form  of  average  y\  taken  over 
the  sequence. 

6.2.2  For  a  finite  sequence  y  =  yi,—,yr  the  *‘left-right’’  threshold  based  segmentation 
method  used  in  the  basic  VFR  algorithm  should  be  replaced  with  a  “global”  dy¬ 
namic  programming  based  segmentation  algorithm  such  as  that  described  in  [4]. 
The  dynamic  programming  based  method  does  not  rely  on  a  threshold.  It  is  used  to 
partition  the  sequence  y  into  a  sequence  of  M  subsequences  yj* , ...,  yli_i+i ,  •••»  y*v_,+i 
{1  <  ti  <  ...  <  tif  =  T)  such  that  some  criterion 

M 

»=1 
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is  minimised.  The  quantity  is  typically  a  distortion  measure  on  the 

sequence  for  example  the  sum  of  euclidean  distances  between  vectors  in 

the  sequence  and  the  sequence  mean. 

6.2.3  In  the  context  of  Markov  model  based  speech  pattern  processing  it  is  clearly  sub- 
optimal  to  segment  the  sequence  of  acoustic  observation  vectors  and  discard  infor¬ 
mation  during  VFR  analysis,  and  then  to  perform  a  second  state-level  segmenta¬ 
tion.  The  segmentation  of  the  observation  sequence  during  VFR  analysis  should 
be  integrated  vrith  the  state-level  segmentation  performed  in  the  model  based  anal¬ 
ysis. 

6.3  Interpretation  of  VFR  analysis  in  terms  of  segmental 
HMMs 

It  will  be  shown  that  extending  the  basic  VFR  analysis  algorithm  in  the  ways  de¬ 
scribed  above  leads  naturally  to  a  segmental  HMM  based  analysis.  Suppose  that  At  = 
(7r,j4,{6,})  is  a  HMM,  with  6,  =  and  that  y  =  yi,...,yt,...,yr  is  a  sequence  of 

acoustic  vectors  in  R**.  In  a  dynamic  programming  based  VFR  scheme  of  the  type  al¬ 
luded  to  in  6.2.2  above,  dynamic  programming  is  used  to  find  a  partition  of  the  sequence 
y  into  M  subsequences  such  that 

M 

(16) 

t=i 


is  minimised. 

Taking  account  of  (6.2.1),  following  VFR  analysis  the  sequence  y  would  be  represented 
by  the  sequence 

where  denotes  some  form  of  average  over  the  sequence 

During  subsequent  HMM  based  processing,  dynamic  programming  is  used  again  to  find 
a  state  sequence  x  =  Zi,  relative  to  the  HMM  Ad,  such  that  the  probability 

»=i 

is  maximised.  Here  D,,  is  a  state  dependent  duration  pdf  which  is  applied  to  the  VFR 
count  di. 

The  goal  of  6.2.3  is  that  ideally  the  two  equations  (16)  and  (17)  should  be  optimised  si¬ 
multaneously  rather  than  separately.  To  achieve  this  it  is  necessary  to  make  assumptions 
about  the  form  of  the  distortion  measure  D.  Suppose  that 

■D(y,‘;_,+i)  =  (18) 

+1 
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where  Devc  denotes  the  squared  euclidean  metric.  Then,  since 

DEVciyt,y\',_,+i)  =  l)(j/f)  +  K2  (19) 

where  A'l  and  A'2  are  constants,  minimising  equation  (16)  is  equivalent  to  maximising 
the  quantity 

M  ti 

p(ti,...,t.,...,<j,)=n  n  (20) 

.=1  (=(..,41 

Equations  (17)  and  (20)  can  now  be  combined  to  give  an  evaluation  criterion  for  a  VFR 
analysis  scheme  which  satisfies  6.2.1,  6.2.2  and  6.2.3: 


<=»._, +1 


But  equation  (21)  has  precisely  the  same  form  as  equation  (12),  with  =  1,  for  all  i, 
and 

St,'_,+i  ~  (22) 

In  other  words,  replacing  the  basic  VFR  analysis  procedure  described  in  section  6.1 
with  the  obvious  dynamic  programming  based  method  and  then  integrating  this  with 
the  higher-level  Markov  model  based  processing  leads  naturally  to  the  type  of  gaussian 
segmental  EMM  based  analysis  which  is  proposed  in  this  memorandum.  In  this  sense 
segmental  HMMs  can  be  regarded  as  a  natural  extension  and  integration  of  VFR  analysis 
and  HMM-based  analysis. 


7  Relationship  with  multi-modal  gaussian  mixture 
densities 

One  of  the  classes  of  state  output  pdf  which  is  frequently  used  in  conventional  hidden 
Markov  modelling  is  the  class  of  gaussian  mixture  densities.  In  such  a  system  the  state 
output  pdf  bi  associated  with  the  ith  state  has  the  form 

=  (23) 

j=i 

for  any  observation  o,  where  Wj  =  1.  There  is  also  a  continuous  analogue  of  (23) 
of  the  form 

6i(o)  =  j  (24) 

where  Jj  w{j)dj  =  1  Parameter  reestimation  formulae  for  models  based  on  (23)  and  (24) 
have  been  established  in  [7]  and  [5],  and  in  [7]  respectively. 

Gaussian  mixture  state  models  are  used  to  compensate  for  the  fact  that  in  reality  the  set 
of  observations  associated  with  a  particular  state  will  not  generally  be  consistent  with  a 
single  gaussian  process.  This  is  particularly  true  in  cases  where  the  models  in  question 
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are  used  to  characterise  speech  from  a  number  of  speakers.  Thus,  gaussian  mixtures 
are  typically  used  to  model  broad  sources  of  extra-segmental  variablity  and  hence,  from 
the  viewpoint  of  this  memorandum,  they  exacerbate  the  problems  associated  with  the 
independence  assumption  within  a  state. 

The  segment  model  proposed  here  is  clearly  related  to  (24),  however  in  the  new  type 
of  model  a  single  component  of  the  continuous  mixture  is  chosen  on  entering  a  state 
and  all  observations  emitted  during  a  particular  state  occupancy  are  drawn  from  that 
component.  In  the  case  of  (24)  a  different  component  can  be  used  to  explain  each 
individual  observation.  Thus,  the  new  type  of  model  can  be  regarded  as  a  continuous 
gaussian  mixture  in  which  all  observations  corresponding  to  a  particular  state  occupancy 
are  constrained  to  come  from  the  process  associated  with  a  single  component  of  that 
mixture. 


8  Parameter  Re-estimation  for  Segmental  HMMs 

The  analysis  presented  in  this  section  is  concerned  with  the  derivation  of  reestimation 
formulae  for  the  parameters  of  the  state  segment  models,  namely  fii,  7j  and  Tj.  The 
reestimation  formulae  for  the  initial  state  probabilities  and  state  transition  probabil¬ 
ities  a,j  are  the  same  as  those  presented  in  [10]  and  are  not  re-derived  here.  Similarly, 
no  precise  form  for  the  state  duration  pdfs  is  assumed  other  than  that  they  should  be 
independent  of  the  parameters  of  the  state  segment  models.  Reestimation  formulae  for 
non-parameteric  discrete,  poisson  and  gamma  state  duration  pdfs  are  given  in  [10,  6]. 

As  in  section  5.2,  the  derivations  presented  below  are  simplified  by  assuming  that  all 
observation  vectors  yt  are  l-dimensional  and  that  the  underlying  Markov  model  is  strictly 
left-right.  Again  it  is  emphasised  that  these  assumptions  are  not  necessary  but  are  made 
in  order  to  focus  attention  onto  those  aspects  of  the  mathematics  which  are  directly 
relevant  to  the  segmental  nature  of  the  models  and  to  reduce  notational  complexity. 

The  analysis  focuses  on  the  quantity  P{y\M).  Hence,  given  a  model  Ad,  the  goal  of 
reestimation  is  to  derive  a  new  model  Ad  such  that 

P{y\M)  >  P{y\M)  (25) 

As  for  conventional  HMMs  and  HSMMs  [2,  7)  it  is  convenient  at  this  point  to  introduce 
a  function  Q{M, .)  of  Ad,  called  the  auxiliary  function,  defined  by, 

Q{M,M)  =  '^P{y,x\M)logP{y,x\M)  (26) 

The  auxiliary  function  has  the  following  property 
Claim 

If  Q{M,M)  >  C(Ad,  Ad)  then  Piy\M)  >  P(y|Ad) 

Proof 
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The  proof  is  the  same  as  in  [2,  7] 

It  follows  that  in  order  to  find  a  model  M  which  satisfies  equation  (25)  it  is  sufficient 
to  find  M  such  that  In  particular  it  is  sufficient  to  find  a 

model  M  which  maximises  This  maximum  is  obtained  by  setting  the  partial 

derivatives  of  G(-M,  Al)  with  respect  to  the  parameters  of  Ad  equal  to  zero  and  solving 
the  resulting  equations. 

8.1  Derivation  of  the  Reestimations  Formulae 


Claim 


Let  y  be  a  sequence  of  observation  vectors  and  let  Ai  be  a  gaussian  segmental  HMM 
as  in  section  5.2.2.  Let  Ad  be  the  gaussian  segmental  HMM  with  parameters  defined  as 
foUows, 


JLz€SiPiy,x\M)di 
E,€5. -P(y>®|Ad) 

s:.65.p(y,x|Ad)slU-.fi(^«..-y«)' 

E,€s.  Piy,x\M)di 


(27) 

(28) 
(29) 


where  5,  =  {x  :  Xt  =  ffi  for  some  t}  and 

-  yi7. 


(30) 


Then  provided  that 


(i)  7i  >  Tj  for  all  i,  and 

(ii)  the  sequence  y  =  yi,...,yT  is  not  constant 

P{y\M)  >  P(y|Ad). 

Proof 

The  arguments  follow  those  in  [2]  and  [7j. 

From  the  previous  discussion  it  is  sufficient  to  show  that  the  model  M  defined  above 
maximises  Q(Ad,  Ad)  as  a  function  of  Ad.  The  proof  is  divided  into  three  stages: 

•  Ad  is  a  critical  point  of  G(Ad,Ad) 

•  C(Ad,  Ad)  is  strictly  concave  in  Ad 

•  — »  — oo  as  Ad  approaches  the  boundary  of  the  parameter  space 
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The  first  of  these  stages  is  presented  below,  since  it  involves  the  derivation  of  the  reesti¬ 
mation  formulae.  The  remaining  two  assertions  are  demonstrated  in  appendices  A  and 

B. 


From  equation  (30), 


SK,i  _ 
dfii 

(31) 

dKi  iiidiiii  -  0) 

dfi  '  Kli 

(32) 

BK.i  n{0  -  difii) 

flii  “  Kl, 

(33) 

—  S«=ti_i+i  y* 

From  equation  (12), 


N 


logP{y,x\M)  = 

i=l 


i=l 


(34) 

(35) 


And  from  (2), 

logP^,{y\:.,+i)  = 

ti 

=  lopI?<(dj) -f  lojA/(^.,,,)(c) -f  ^09’^{c,r,){yt) 


(36) 

(37) 


Derivation  of  /2, 

From  equations  (26),  (35)  and  (37), 

^Q{M,M)  =  53F(y,*|A1)^/opP(y,i|A4) 
dfii  , 

«  i=i 

=  S  Piy,x\M)-^logP,,(y\i_^+i) 

«€5. 


Now,  using  (31) 


-  Mi) 

K,,i 


(38) 

(39) 

(40) 

(41) 

(42) 
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^  I  /  \  _  iVt 


Therefore, 


Y.  '‘‘K  t  <«> 

x€Si  ■^*.*  «=t— 1+1 


Setting  the  partial  derivative  to  zero,  to  obtain  a  critical  point,  and  multiplying  through 
ty  -K".,.  gives, 

0  =  X]  -^,)+  ^  (Pi  -  f.,.))  (45) 

*€5.  l=t,_,+l 

=  i  yt~^ifii)  (46) 

*€5.  «=«,_, +1 

It  foUows  that. 


Sags,  yt 

E.65. 


which  is  the  required  result. 

Derivation  of  fj 

As  in  the  derivation  of  p, 


But,  using  (32), 


— /osAr(^.,.,(c.,)  - - - 


Also,  again  using  (32), 


^  1  if  /  \  ^  ^  /2Ti7j(c,i  Pi)(<f,'/li  O)  2 

- jr - -  y) ) 
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from  which  it  foUows  that, 

«=«,_! +1  ^T.  £.T.  n  .  - 


2f,  27;? 


«=»._, +1 


Therefore  at  a  critical  point,  combining  equations  (48),  (49)  and  (51), 


OTi 

=  H  -P(y,a;|^)[ 

*es. 


(m»  c,^,)(<fj/i,'  O) 


,  ,  di  1  ^2f,7,(J,c,,i  -  0){diHi  -  O) 
^  2f,  2f?^ 


Consider  the  term  in  square  brackets.  Multiplying  by  —1  and  rearanging  gives, 

-  S.,0  +  -  O)}  +  i  i  t  (S...  -S.)’  (53) 

Multiplying  by  fj  and  expanding  the  terms  in  curly  brackets  gives 

(<f./I.  0)^_-.-,  ^  ^  ^  ^  (g,,,  -  ytf  (54) 

Now  consider  the  term  in  curly  brackets.  This  can  be  rewritten  as, 

-  7iC>  +  a,,.(f,  +  di7i)  =  -{fifii  ^iO) {fifii  +  ^iO) 

=  0 

from  the  definition  of 
Therefore  equation  (52)  reduces  to 

0=  E  (Vi -».)’]  (55) 

«€5.  ^  «=I._I+1 


From  which  it  foUows  that. 


£.€S.  P{y,x\M)di 
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Derivation  of  7^ 


23 


Therefore  equation  (57)  reduces  to, 


0  =  -  :^(Ai  - 

.€Si  Ti 

It  follows  that 

^  S,€5,  -  ^..•)^ 

’  E.e5.^^(y,*|Ad) 

This  concludes  the  derivation  of  the  reestimation  formulae  for  parameters  /tj,  and  7i. 

8.2  Remarks  on  the  derivation  of  the  reestimation  formulae 

As  with  the  standard  reestimation  formula  for  the  variance  of  a  gaussian  state  in  a 
conventional  HMM,  the  reestimated  value  of  the  state  mean  fi  appears  on  the  right 
hand  side  of  equation  (28).  However,  because  c*,,-  is  a  function  of  7,  the  term  7  appears 
on  both  sides  of  equation  (28).  For  the  purposes  of  implementation  it  is  natural  to  use 
the  old  values  n  and  7  on  the  right-hand  side  of  equation  (28).  The  implications  of  this 
will  be  investigated  experimentally.  The  analogous  remarks  hold  for  the  quantity  f  in 
equation  (29). 

The  assumptions 

(i)  7i  >  fj  for  all  t,  and 

(ii)  the  sequence  y  =  yi,...,i/T  is  not  constant 

are  sufficient  to  ensure  that  the  critical  point  of  the  auxilliary  function  is  a  unique 
maximum.  It  is  noted  in  section  B.2  that  the  way  in  which  these  assumptions  are  used 
in  the  proof  suggests  that  they  may  also  be  necessary. 


(63) 

(64) 


9  Conclusions 

This  memorandum  has  presented  a  new  segmental  HMM  which  addresses  some  of  the 
limitations  of  conventional  HMMs  in  the  context  of  speech  pattern  modelling.  The  main 
features  of  the  new  model  are: 

•  The  use  of  an  underlying  semi-Markov  process  to  model  speech  patterns  at  the 
segment  level,  and 

•  A  segment  model  in  which  separate  processes  are  used  to  model  extra-  segment 
and  intra-segment  variability. 
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It  has  been  shown  that  the  model  is  computationally  useful  to  the  extent  that  it  admits 
extensions  of  the  conventional  EMM  classification  and  parameter  estimation  algorithms. 

It  has  been  shown  that  segmental  HMMs  can  be  regarded  as  an  extension  and  integration 
of  conventional  variable  frame  rate  analysis  and  hidden  Markov  modelling.  In  addition, 
the  relationship  between  gaussian  segmental  HMMs  and  continuous  gaussian  mixture 
HMMs  has  been  explored. 

Segmental  HMMs  ensure  that  extra-segmental  factors,  such  as  choice  of  acoustic  target 
or  identity  of  speaker,  are  fixed  throughout  a  segment  rather  than  being  allowed  to 
vary  in  synchrony  with  the  speech  pattern  feature  vectors  as  in  a  conventional  HMM. 
At  present,  they  do  not  ensure  that  factors  such  as  identity  of  speaker  are  preserved 
between  segments,  nor  do  they  model  the  dynanuc  nature  of  speech  patterns.  However, 
it  is  hoped  that  by  identifying  the  target  pdfs  more  closely  with  parameters  which 
reflect  these  factors,  for  example  articulatory  parameters,  the  type  of  segmental  HMMs 
described  here  can  be  extended  to  address  these  issues  in  a  similar  manner  to  that 
described  in  [l] 
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INTENTIONALLY  BLANK 


A  Proof  of  the  concavity  of  Q{M,M) 

Claim  In  the  notation  of  section  8,  if  7^  >  fi  for  all  »,  then  is  strictly  concave 

in  M. 

Proof 

It  is  sufficient  to  show  that 

-^QiM,M)<0  (65) 

for  A  =  /!,,  7j  and  f,,  for  all  i. 

Claim:  <  0 

This  is  straightforward.  Differentiating  equation  (44)  gives, 

(66) 

*€S. 


since  d,,  7^,  fj  and  K,,i  are  all  strictly  positive. 

Claim:  ^Q{M,M)  <  0 

From  the  derivation  of  equation  (63), 


av"  '  '  .^s. 

Therefore,  differentiating  again  with  respect  to  7^, 

^  ^(kA  k~j\  —  Til..  _I6^\  ^  ~  ^*,k){0  —  ^  j 

=  -  2^  /?(y,x|M)^( - -j - +  -  c,,i)  ) 

(68) 

Now,  it  follows  from  the  definition  of  that 

(O-diM.)  =  -%i(M^-£..0  (69) 

/» 

Therefore,  from  equation  (68), 

^a(M,M)  =-■£  -p-)-  7<)  (70) 

But,  since  7i  >  f,-,  it  follows  that  <  1.  Hence 


0 
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(71) 

(72) 


from  the  definition  of  7^  (equ&tion  (28)). 

Claim:  ^Q(M,M)  <  0 

Using  the  derivation  of  equation  (55),  it  is  seen  that 


m€Si  ^  *=ti-i+l 

It  follows  that, 

^e(M,M)  =  £  P(v,ztM)l^(^ 

•eSi  ^ 

+  (^  i:  (4.  -  -  O)  -  i  £  (4..i  -  S,f))l  (74) 


From  the  definition  of  c,,j. 


difii  -  O  =  -  c,,i) 

li 


and 


]C  -  y«)  =  r^(/i<  -  ^..<) 

<=t,_i+i  'y» 


Substituting  the  last  two  results  into  equation  (74)  gives 

,65.  ^ 


^  L  (C-y«)* 


)i 


*€5, 

i  *' 

f. 


-  r  E  (4..V -».)’)] 

f?^2^27i  2fJ 
<  0 


(75) 

(76) 


(77) 

(78) 

(79) 

(80) 


where  V  =  E.gs-  P(y,xlM)di  and  2  =  ^(j/.*l-^)  Inequality  (78)  holds  because 

it  is  assumed  that  7^  >  Equation  (79)  follows  from  the  definitions  of  7^  (28)  and  f,- 
(29).  The  final  inequality  follows  from  the  fact  that  V  >X. 
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B  Q{M^  M)  — oo  as  M.  approaches  the  boundary 
of  the  parameter  space 

This  sectioQ  uses  the  notation  of  the  main  body  of  the  memorandum.  The  assumptions 
that  >  fj  and  that  the  observation  sequence  y  =  yj,— >yr  is  not  constant  are  both 
used  in  the  proof. 


B.l  Proof 

Focussing  again  on  the  parameters  fii,  %  and  f,,  the  cases  which  must  be  considered  are: 


•  fLi  -*  ±oo 

•  7,  — >  0  or  7,  — *  oo 

•  fj  — ►  0  or  fj  — *  oo 


From  equation  (26), 

Q{M,M)  =  Y^Piy,x\M)logP{y,x\M) 

9  t=l 

f=f,_,+i 


(81) 


Hence  it  is  sufficient  to  show  that 


iogJ^(i,.,>i.){K,i)  =  -\log{2Tr)  -  ^log{^,)  -  2^^*’ ~ 


(82) 


and 


t=la.l+l  I— 


2f. 


tend  to  —oo  in  all  of  the  cases  listed  above. 


Case  1:  A» 

This  is  straightforward.  To  see  that  (82)  tends  to  —  oo  note  that 

{Hi-Ki)’  ^  -  0)’ii 

2(f.  +  ■t'Ti)’ 

— ♦  oo  as  — »  ±oo 


(83) 


(84) 

(85) 


29 


(86) 

(87) 


and,  in  the  case  of  equation  (83), 

(K,i  -  ViY  _  +  Oii  -  Vtjfi  +  dj^iY) 

2fi  2(f,  +  dniYfi 

— *  oo  as  /tj  — »  ±oo 

Hence,  since  all  other  terms  in  (82)  and  (83)  are  independent  of  fii,  C(-A4,Ad)  — »  — oo 
d£  jii  —*  ±oo 

Case  2.1:  7i  — ♦  0 

As  above 

{fii  -  _  ifiidi  -  Of^i 

27.  "  2(f,  +  d,7.)’  ^  ^ 

2(<ii  +  l)>i.  '  ' 

because  7j  >  fj.  Hence  —  oo  as  7,  — ♦  0,  because  ^  »  00  faster  than 

log{^i)  — >  —  00  as  7i  — »  0.  Similarly,  from  (86), 

Cc,,i  -  _  (t^M.  +  Oii  -  y«(^.  +  <^7»)^)  /gQX 

2fi  2{fi  +  d,-7i)*fi 

^  ifiifi  +  7<  -  y«(^<  +  <^.7.))*  ^ 

^  — 2(d.  +  ir-if 

— »  00  as  7.  —♦  0  (92) 


Hence  — »  —00  as  7i  — »  0 

Case  2.2:  7,  — >  00 
From  equation  (84), 

(A.  -  K,iy  ^  -  0)^7. 

27.  2(fi  +  di^,y 

— »  0  as  7<  — ♦  00 


Therefore 
Also,  since 


-»  -00  as  7.  -»  00 


_  iiifi  -t-  C>7j 

+  <fi7. 

O 

^ 


it  foUows  that  is  bounded  hs  % —*  00. 

Hence  Q{MyM)  -*  —00  as  »  00,  as  required. 


(93) 

(94) 

(95) 


Case  3.1:  — ♦  0 


t 


Since 


C(*,<)  — »  —  as  fi 


(96) 


it  foUows  that  is  bounded  as  f,  — »  0.  This  leaves  the  term  ,),»o(y*)' 

The  relevant  contribution  of  this  factor  to  consists  of  weighted  sums  of  the 

terms  —log(fi)  and  .  But 


(5...  -  yt)*  as 


(97) 


Hence,  provided  that  ^  ^  y«  for  some  <,  Q{M.,M)  will  tend  to  — oo  as  fj  — ♦  0  This 
acounts  for  the  assumption  that  the  observation  sequence  y  =  yii—tyr  is  i^ot  constant. 

Case  3.2:  f j  — »  oo 

As  fj  — ♦  oo,  — ♦  oo,  since  7,  >  fj  by  assumption.  The  argument  for  case  2.2  above 

-  then  shows  that  lo5A/(ji,,ij,)(c,_j)  — >  —00  as  f,  — >  00 

Finally, 


(^(«.0  -  Vtf  _  {Tifii  -I-  O7.  -  VtiTi  -+  dai)f 

2fi  2fi{fi  +  di^iY 

— >  0  as  f,-  — »  00 


(98) 

(99) 


because  7j  >  f,.  Hence  — ♦  —00  as  — ♦  00.  It  follows  that  Q{M,M)  —*  —00 

as  fi  — ♦  00  as  claimed. 


B.2  Remarks 

The  above  arguments  suggest  that  the  condition  7^  >  fi  may  be  necessary  as  well  as 
sufficient.  In  case  2.1,  if  it  is  not  the  case  that  f;  — »  0  as  7j  — »  0,  then  it  is  possible  for  the 
terms  in  eqautions  (88)  and  (90)  to  be  b  unded  above  as  f,  — »  0.  In  this  case  the  term 
—log{^i)  will  dominate  and  Q(A1,  At)  will  tend  to  infinity  as  7^  — ♦  0.  Consequently  the 
point  identified  by  the  reestimation  formulae  will  no  longer  necessarily  be  a  maximum 
of  the  auxiliary  function. 


t 
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